Your browser doesn't support javascript.
loading
Show: 20 | 50 | 100
Results 1 - 20 de 308
Filter
1.
PLoS Comput Biol ; 18(1): e1009628, 2022 01.
Article in English | MEDLINE | ID: mdl-35025869

ABSTRACT

Genome-wide association studies rely on the statistical inference of untyped variants, called imputation, to increase the coverage of genotyping arrays. However, the results are often suboptimal in populations underrepresented in existing reference panels and array designs, since the selected single nucleotide polymorphisms (SNPs) may fail to capture population-specific haplotype structures, hence the full extent of common genetic variation. Here, we propose to sequence the full genomes of a small subset of an underrepresented study cohort to inform the selection of population-specific add-on tag SNPs and to generate an internal population-specific imputation reference panel, such that the remaining array-genotyped cohort could be more accurately imputed. Using a Tanzania-based cohort as a proof-of-concept, we demonstrate the validity of our approach by showing improvements in imputation accuracy after the addition of our designed add-on tags to the base H3Africa array.


Subject(s)
Genetics, Population , Genome-Wide Association Study , Genotype , Polymorphism, Single Nucleotide/genetics , Computational Biology/methods , Genetics, Population/methods , Genetics, Population/standards , Genome-Wide Association Study/methods , Genome-Wide Association Study/standards , Humans , Male , Tanzania
2.
Mol Genet Genomics ; 297(1): 33-46, 2022 Jan.
Article in English | MEDLINE | ID: mdl-34755217

ABSTRACT

Based on molecular markers, genomic prediction enables us to speed up breeding schemes and increase the response to selection. There are several high-throughput genotyping platforms able to deliver thousands of molecular markers for genomic study purposes. However, even though its widely applied in plant breeding, species without a reference genome cannot fully benefit from genomic tools and modern breeding schemes. We used a method to assemble a population-tailored mock genome to call single-nucleotide polymorphism (SNP) markers without an available reference genome, and for the first time, we compared the results with standard genotyping platforms (array and genotyping-by-sequencing (GBS) using a reference genome) for performance in genomic prediction models. Our results indicate that using a population-tailored mock genome to call SNP delivers reliable estimates for the genomic relationship between genotypes. Furthermore, genomic prediction estimates were comparable to standard approaches, especially when considering only additive effects. However, mock genomes were slightly worse than arrays at predicting traits influenced by dominance effects, but still performed as well as standard GBS methods that use a reference genome. Nevertheless, the array-based SNP markers methods achieved the best predictive ability and reliability to estimate variance components. Overall, the mock genomes can be a worthy alternative for genomic selection studies, especially for those species where the reference genome is not available.


Subject(s)
Computational Biology , Genotyping Techniques , Models, Genetic , Animals , Chimera/genetics , Computational Biology/methods , Computational Biology/standards , Datasets as Topic , Genome , Genome-Wide Association Study/methods , Genome-Wide Association Study/standards , Genomics/methods , Genomics/standards , Genotype , Genotyping Techniques/methods , Genotyping Techniques/standards , Phenotype , Reference Standards , Reproducibility of Results , Selection, Genetic , Species Specificity , Zea mays/classification , Zea mays/genetics
3.
PLoS Genet ; 17(12): e1009944, 2021 12.
Article in English | MEDLINE | ID: mdl-34941872

ABSTRACT

High-throughput genotyping of large numbers of lines remains a key challenge in plant genetics, requiring geneticists and breeders to find a balance between data quality and the number of genotyped lines under a variety of different existing genotyping technologies when resources are limited. In this work, we are proposing a new imputation pipeline ("HBimpute") that can be used to generate high-quality genomic data from low read-depth whole-genome-sequence data. The key idea of the pipeline is the use of haplotype blocks from the software HaploBlocker to identify locally similar lines and subsequently use the reads of all locally similar lines in the variant calling for a specific line. The effectiveness of the pipeline is showcased on a dataset of 321 doubled haploid lines of a European maize landrace, which were sequenced at 0.5X read-depth. The overall imputing error rates are cut in half compared to state-of-the-art software like BEAGLE and STITCH, while the average read-depth is increased to 83X, thus enabling the calling of copy number variation. The usefulness of the obtained imputed data panel is further evaluated by comparing the performance of sequence data in common breeding applications to that of genomic data generated with a genotyping array. For both genome-wide association studies and genomic prediction, results are on par or even slightly better than results obtained with high-density array data (600k). In particular for genomic prediction, we observe slightly higher data quality for the sequence data compared to the 600k array in the form of higher prediction accuracies. This occurred specifically when reducing the data panel to the set of overlapping markers between sequence and array, indicating that sequencing data can benefit from the same marker ascertainment as used in the array process to increase the quality and usability of genomic data.


Subject(s)
Genome-Wide Association Study/standards , Genotyping Techniques , Haplotypes/genetics , Software , DNA Copy Number Variations/genetics , Genome/genetics , Genomics/methods , Genotype , Polymorphism, Single Nucleotide/genetics , Whole Genome Sequencing , Zea mays/genetics
4.
Sci Rep ; 11(1): 19571, 2021 10 01.
Article in English | MEDLINE | ID: mdl-34599249

ABSTRACT

Ongoing increases in the size of human genotype and phenotype collections offer the promise of improved understanding of the genetics of complex diseases. In addition to the biological insights that can be gained from the nature of the variants that contribute to the genetic component of complex trait variability, these data bring forward the prospect of predicting complex traits and the risk of complex genetic diseases from genotype data. Here we show that advances in phenotype prediction can be applied to improve the power of genome-wide association studies. We demonstrate a simple and efficient method to model genetic background effects using polygenic scores derived from SNPs that are not on the same chromosome as the target SNP. Using simulated and real data we found that this can result in a substantial increase in the number of variants passing genome-wide significance thresholds. This increase in power to detect trait-associated variants also translates into an increase in the accuracy with which the resulting polygenic score predicts the phenotype from genotype data. Our results suggest that advances in methods for phenotype prediction can be exploited to improve the control of background genetic effects, leading to more accurate GWAS results and further improvements in phenotype prediction.


Subject(s)
Genetic Background , Genome-Wide Association Study , Models, Genetic , Multifactorial Inheritance , Phenotype , Area Under Curve , Biological Specimen Banks , Genetic Predisposition to Disease , Genetic Variation , Genome-Wide Association Study/methods , Genome-Wide Association Study/standards , Humans , Polymorphism, Single Nucleotide , Quantitative Trait Loci , ROC Curve , United Kingdom
5.
Genetica ; 149(5-6): 313-325, 2021 Dec.
Article in English | MEDLINE | ID: mdl-34480683

ABSTRACT

Reducing false discoveries caused by population stratification (PS) has always been a challenge in genome-wide association studies (GWAS). The current literature established several single marker approaches including genomic control (GC), EIGENSTRAT and generalized linear mixed model association test (GMMAT) and multi-marker methods such as LASSO mixed model (LASSOMM). However, the single-marker methods require prespecifying an arbitrary p value threshold in the selection process, likely resulting in suboptimal precision or recall. On the other hand, it appears that LASSOMM is extremely computationally intensive and may not suitable for large-scale GWAS. In this paper, we proposed a simple multi-marker approach (PCA-LASSO) combining principal component analysis (PCA) and least absolute shrinkage and selection operator (LASSO). We utilize PCA to correct for the confounding effects of PS and LASSO with built-in cross-validation for a data-driven selection. Compared to the current single-marker approaches, the proposed PCA-LASSO provides optimal balance between precision and recall, and consequently superior F1 scores. Similarly, compared to LASSOMM, PCA-LASSO markedly increases the precision while minimizing the loss of recall, and therefore improves the overall F1 score in presence of PS. More importantly, PCA-LASSO drastically reduces the computational time by > 1000 times when compared to LASSOMM. We applied PCA-LASSO to a real dataset of Alzheimer's disease and successfully identified SNP rs429358 (Gene APOE4) which has been widely reported to be associated with the onset and elevated risk of Alzheimer's disease. In conclusion, PCA-LASSO is a simple, fast, but accurate approach for GWAS in presence of latent PS.


Subject(s)
Genetic Predisposition to Disease , Genome-Wide Association Study/methods , Genome-Wide Association Study/standards , Alzheimer Disease/genetics , Datasets as Topic , Genomics , Humans , Principal Component Analysis , Time Factors
6.
Brief Bioinform ; 22(6)2021 11 05.
Article in English | MEDLINE | ID: mdl-34459489

ABSTRACT

In genome-wide association studies (GWAS), it has become commonplace to test millions of single-nucleotide polymorphisms (SNPs) for phenotypic association. Gene-based testing can improve power to detect weak signal by reducing multiple testing and pooling signal strength. While such tests account for linkage disequilibrium (LD) structure of SNP alleles within each gene, current approaches do not capture LD of SNPs falling in different nearby genes, which can induce correlation of gene-based test statistics. We introduce an algorithm to account for this correlation. When a gene's test statistic is independent of others, it is assessed separately; when test statistics for nearby genes are strongly correlated, their SNPs are agglomerated and tested as a locus. To provide insight into SNPs and genes driving association within loci, we develop an interactive visualization tool to explore localized signal. We demonstrate our approach in the context of weakly powered GWAS for autism spectrum disorder, which is contrasted to more highly powered GWAS for schizophrenia and educational attainment. To increase power for these analyses, especially those for autism, we use adaptive $P$-value thresholding, guided by high-dimensional metadata modeled with gradient boosted trees, highlighting when and how it can be most useful. Notably our workflow is based on summary statistics.


Subject(s)
Algorithms , Computational Biology/methods , Genetic Predisposition to Disease , Genetic Testing/standards , Genome-Wide Association Study/methods , Genome-Wide Association Study/standards , Alleles , Chromosome Mapping , Databases, Genetic , Genetic Testing/methods , Humans , Linkage Disequilibrium , Phenotype , Polymorphism, Single Nucleotide , Quantitative Trait Loci
7.
Genet Sel Evol ; 53(1): 64, 2021 Jul 29.
Article in English | MEDLINE | ID: mdl-34325663

ABSTRACT

BACKGROUND: With the completion of a single nucleotide polymorphism (SNP) chip for honey bees, the technical basis of genomic selection is laid. However, for its application in practice, methods to estimate genomic breeding values need to be adapted to the specificities of the genetics and breeding infrastructure of this species. Drone-producing queens (DPQ) are used for mating control, and usually, they head non-phenotyped colonies that will be placed on mating stations. Breeding queens (BQ) head colonies that are intended to be phenotyped and used to produce new queens. Our aim was to evaluate different breeding program designs for the initiation of genomic selection in honey bees. METHODS: Stochastic simulations were conducted to evaluate the quality of the estimated breeding values. We developed a variation of the genomic relationship matrix to include genotypes of DPQ and tested different sizes of the reference population. The results were used to estimate genetic gain in the initial selection cycle of a genomic breeding program. This program was run over six years, and different numbers of genotyped queens per year were considered. Resources could be allocated to increase the reference population, or to perform genomic preselection of BQ and/or DPQ. RESULTS: Including the genotypes of 5000 phenotyped BQ increased the accuracy of predictions of breeding values by up to 173%, depending on the size of the reference population and the trait considered. To initiate a breeding program, genotyping a minimum number of 1000 queens per year is required. In this case, genetic gain was highest when genomic preselection of DPQ was coupled with the genotyping of 10-20% of the phenotyped BQ. For maximum genetic gain per used genotype, more than 2500 genotyped queens per year and preselection of all BQ and DPQ are required. CONCLUSIONS: This study shows that the first priority in a breeding program is to genotype phenotyped BQ to obtain a sufficiently large reference population, which allows successful genomic preselection of queens. To maximize genetic gain, DPQ should be preselected, and their genotypes included in the genomic relationship matrix. We suggest, that the developed methods for genomic prediction are suitable for implementation in genomic honey bee breeding programs.


Subject(s)
Bees/genetics , Models, Genetic , Selective Breeding , Animals , Genome, Insect , Genome-Wide Association Study/methods , Genome-Wide Association Study/standards , Genotyping Techniques/methods
8.
Genet Sel Evol ; 53(1): 46, 2021 May 31.
Article in English | MEDLINE | ID: mdl-34058971

ABSTRACT

BACKGROUND: In dairy cattle populations in which crossbreeding has been used, animals show some level of diversity in their origins. In rotational crossbreeding, for instance, crossbred dams are mated with purebred sires from different pure breeds, and the genetic composition of crossbred animals is an admixture of the breeds included in the rotation. How to use the data of such individuals in genomic evaluations is still an open question. In this study, we aimed at providing methodologies for the use of data from crossbred individuals with an admixed genetic background together with data from multiple pure breeds, for the purpose of genomic evaluations for both purebred and crossbred animals. A three-breed rotational crossbreeding system was mimicked using simulations based on animals genotyped with the 50 K single nucleotide polymorphism (SNP) chip. RESULTS: For purebred populations, within-breed genomic predictions generally led to higher accuracies than those from multi-breed predictions using combined data of pure breeds. Adding admixed population's (MIX) data to the combined pure breed data considering MIX as a different breed led to higher accuracies. When prediction models were able to account for breed origin of alleles, accuracies were generally higher than those from combining all available data, depending on the correlation of quantitative trait loci (QTL) effects between the breeds. Accuracies varied when using SNP effects from any of the pure breeds to predict the breeding values of MIX. Using those breed-specific SNP effects that were estimated separately in each pure breed, while accounting for breed origin of alleles for the selection candidates of MIX, generally improved the accuracies. Models that are able to accommodate MIX data with the breed origin of alleles approach generally led to higher accuracies than models without breed origin of alleles, depending on the correlation of QTL effects between the breeds. CONCLUSIONS: Combining all available data, pure breeds' and admixed population's data, in a multi-breed reference population is beneficial for the estimation of breeding values for pure breeds with a small reference population. For MIX, such an approach can lead to higher accuracies than considering breed origin of alleles for the selection candidates, and using breed-specific SNP effects estimated separately in each pure breed. Including MIX data in the reference population of multiple breeds by considering the breed origin of alleles, accuracies can be further improved. Our findings are relevant for breeding programs in which crossbreeding is systematically applied, and also for populations that involve different subpopulations and between which exchange of genetic material is routine practice.


Subject(s)
Cattle/genetics , Hybridization, Genetic , Polymorphism, Single Nucleotide , Animals , Genome-Wide Association Study/methods , Genome-Wide Association Study/standards , Inbreeding , Models, Genetic , Quantitative Trait Loci , Reference Standards , Selective Breeding
9.
Genes (Basel) ; 12(6)2021 05 27.
Article in English | MEDLINE | ID: mdl-34071952

ABSTRACT

Description of a perpetrator's eye colour can be an important investigative lead in a forensic case with no apparent suspects. Herein, we present 11 SNPs (Eye Colour 11-EC11) that are important for eye colour prediction and eye colour prediction models for a two-category reporting system (blue and brown) and a three-category system (blue, intermediate, and brown). The EC11 SNPs were carefully selected from 44 pigmentary variants in seven genes previously found to be associated with eye colours in 757 Europeans (Danes, Swedes, and Italians). Mathematical models using three different reporting systems: a quantitative system (PIE-score), a two-category system (blue and brown), and a three-category system (blue, intermediate, brown) were used to rank the variants. SNPs with a sufficient mean variable importance (above 0.3%) were selected for EC11. Eye colour prediction models using the EC11 SNPs were developed using leave-one-out cross-validation (LOOCV) in an independent data set of 523 Norwegian individuals. Performance of the EC11 models for the two- and three-category system was compared with models based on the IrisPlex SNPs and the most important eye colour locus, rs12913832. We also compared model performances with the IrisPlex online tool (IrisPlex Web). The EC11 eye colour prediction models performed slightly better than the IrisPlex and rs12913832 models in all reporting systems and better than the IrisPlex Web in the three-category system. Three important points to consider prior to the implementation of eye colour prediction in a forensic genetic setting are discussed: (1) the reference population, (2) the SNP set, and (3) the reporting strategy.


Subject(s)
Eye Color/genetics , Polymorphism, Single Nucleotide , Forensic Genetics/methods , Forensic Genetics/standards , Genome-Wide Association Study/methods , Genome-Wide Association Study/standards , Humans , Models, Genetic , Phenotype , Scandinavian and Nordic Countries
10.
Eur J Hum Genet ; 29(11): 1611-1624, 2021 11.
Article in English | MEDLINE | ID: mdl-34140649

ABSTRACT

Array technology to genotype single-nucleotide variants (SNVs) is widely used in genome-wide association studies (GWAS), clinical diagnostics, and linkage studies. Arrays have undergone a tremendous growth in both number and content over recent years making a comprehensive comparison all the more important. We have compared 28 genotyping arrays on their overall content, genome-wide coverage, imputation quality, presence of known GWAS loci, mtDNA variants and clinically relevant genes (i.e., American College of Medical Genetics (ACMG) actionable genes, pharmacogenetic genes, human leukocyte antigen (HLA) genes and SNV density). Our comparison shows that genome-wide coverage is highly correlated with the number of SNVs on the array but does not correlate with imputation quality, which is the main determinant of GWAS usability. Average imputation quality for all tested arrays was similar for European and African populations, indicating that this is not a good criterion for choosing a genotyping array. Rather, the additional content on the array, such as pharmacogenetics or HLA variants, should be the deciding factor. As the research question of a study will in large part determine which class of genes are of interest, there is not just one perfect array for all different research questions. This study can thus help as a guideline to determine which array best suits a study's requirements.


Subject(s)
Genetic Testing/standards , Genotyping Techniques/standards , Oligonucleotide Array Sequence Analysis/standards , Genetic Testing/methods , Genome-Wide Association Study/methods , Genome-Wide Association Study/standards , Genotyping Techniques/methods , Humans , Oligonucleotide Array Sequence Analysis/methods , Reagent Kits, Diagnostic/standards , Sensitivity and Specificity
11.
Genet Sel Evol ; 53(1): 55, 2021 Jun 29.
Article in English | MEDLINE | ID: mdl-34187354

ABSTRACT

BACKGROUND: Mathematical models are needed for the design of breeding programs using genomic prediction. While deterministic models for selection on pedigree-based estimates of breeding values (PEBV) are available, these have not been fully developed for genomic selection, with a key missing component being the accuracy of genomic EBV (GEBV) of selection candidates. Here, a deterministic method was developed to predict this accuracy within a closed breeding population based on the accuracy of GEBV and PEBV in the reference population and the distance of selection candidates from their closest ancestors in the reference population. METHODS: The accuracy of GEBV was modeled as a combination of the accuracy of PEBV and of EBV based on genomic relationships deviated from pedigree (DEBV). Loss of the accuracy of DEBV from the reference to the target population was modeled based on the effective number of independent chromosome segments in the reference population (Me). Measures of Me derived from the inverse of the variance of relationships and from the accuracies of GEBV and PEBV in the reference population, derived using either a Fisher information or a selection index approach, were compared by simulation. RESULTS: Using simulation, both the Fisher and the selection index approach correctly predicted accuracy in the target population over time, both with and without selection. The index approach, however, resulted in estimates of Me that were less affected by heritability, reference size, and selection, and which are, therefore, more appropriate as a population parameter. The variance of relationships underpredicted Me and was greatly affected by selection. A leave-one-out cross-validation approach was proposed to estimate required accuracies of EBV in the reference population. Aspects of the methods were validated using real data. CONCLUSIONS: A deterministic method was developed to predict the accuracy of GEBV in selection candidates in a closed breeding population. The population parameter Me that is required for these predictions can be derived from an available reference data set, and applied to other reference data sets and traits for that population. This method can be used to evaluate the benefit of genomic prediction and to optimize genomic selection breeding programs.


Subject(s)
Models, Genetic , Selective Breeding , Animals , Genome-Wide Association Study/methods , Genome-Wide Association Study/standards , Livestock/genetics , Pedigree , Polymorphism, Single Nucleotide , Quantitative Trait Loci
12.
Trends Genet ; 37(10): 868-871, 2021 10.
Article in English | MEDLINE | ID: mdl-34183185

ABSTRACT

For identification of marker-trait associations (MTAs) for complex traits in animals and plants, thousands of genome-wide association studies (GWAS) were conducted during the past two decades. This involved regular improvement in methodology. Initially, a reference genome and SNPs were used; more recently pan-genomes and the markers structural variations (SVs)/k-mers are also being used.


Subject(s)
Genome-Wide Association Study/methods , Genome-Wide Association Study/standards , Animals , Genome/genetics , Humans , Phenotype , Plants/genetics , Polymorphism, Single Nucleotide/genetics
13.
Nat Commun ; 12(1): 3506, 2021 06 09.
Article in English | MEDLINE | ID: mdl-34108454

ABSTRACT

In modern Whole Genome Sequencing (WGS) epidemiological studies, participant-level data from multiple studies are often pooled and results are obtained from a single analysis. We consider the impact of differential phenotype variances by study, which we term 'variance stratification'. Unaccounted for, variance stratification can lead to both decreased statistical power, and increased false positives rates, depending on how allele frequencies, sample sizes, and phenotypic variances vary across the studies that are pooled. We develop a procedure to compute variant-specific inflation factors, and show how it can be used for diagnosis of genetic association analyses on pooled individual level data from multiple studies. We describe a WGS-appropriate analysis approach, implemented in freely-available software, which allows study-specific variances and thereby improves performance in practice. We illustrate the variance stratification problem, its solutions, and the proposed diagnostic procedure, in simulations and in data from the Trans-Omics for Precision Medicine Whole Genome Sequencing Program (TOPMed), used in association tests for hemoglobin concentrations and BMI.


Subject(s)
Genetic Variation , Genome-Wide Association Study/methods , Algorithms , Computer Simulation , Gene Frequency , Genome-Wide Association Study/standards , Genome-Wide Association Study/statistics & numerical data , Humans , Phenotype , Sample Size
14.
Genetica ; 149(3): 143-153, 2021 Jun.
Article in English | MEDLINE | ID: mdl-33963492

ABSTRACT

Genome-wide studies are prone to false positives due to inherently low priors and statistical power. One approach to ameliorate this problem is to seek validation of reported candidate genes across independent studies: genes with repeatedly discovered effects are less likely to be false positives. Inversely, genes reported only as many times as expected by chance alone, while possibly representing novel discoveries, are also more likely to be false positives. We show that, across over 30 genome-wide studies that reported Drosophila and Daphnia genes with possible roles in thermal adaptation, the combined lists of candidate genes and orthologous groups are rapidly approaching the total number of genes and orthologous groups in the respective genomes. This is consistent with the expectation of high frequency of false positives. The majority of these spurious candidates have been identified by one or a few studies, as expected by chance alone. In contrast, a noticeable minority of genes have been identified by numerous studies with the probabilities of such discoveries occurring by chance alone being exceedingly small. For this subset of genes, different studies are in agreement with each other despite differences in the ecological settings, genomic tools and methodology, and reporting thresholds. We provide a reference set of presumed true positives among Drosophila candidate genes and orthologous groups involved in response to changes in temperature, suitable for cross-validation purposes. Despite this approach being prone to false negatives, this list of presumed true positives includes several hundred genes, consistent with the "omnigenic" concept of genetic architecture of complex traits.


Subject(s)
Genome-Wide Association Study/methods , Quantitative Trait Loci , Thermotolerance/genetics , Animals , Arthropods/genetics , Arthropods/physiology , False Positive Reactions , Genome-Wide Association Study/standards , Models, Genetic , Polymorphism, Genetic , Reference Standards
15.
Med Sci Sports Exerc ; 53(5): 883-887, 2021 05 01.
Article in English | MEDLINE | ID: mdl-33844668

ABSTRACT

It is clear, based on a deep scientific literature base, that genetic and genomic factors play significant roles in determining a wide range of sport and exercise characteristics including exercise endurance capacity, strength, daily physical activity levels, and trainability of both endurance and strength. Although the research field of exercise systems genetics has rapidly expanded over the past two decades, many researchers publishing in this field are not extensively trained in molecular biology or genomics techniques, sometimes creating gaps in generating high-quality and cutting-edge research for publication. As current or former Associate Editors for Medicine and Science in Sports and Exercise that have handled the majority of exercise genetics articles for Medicine and Science in Sports and Exercise in the past 15 yr, we have observed a large number of scientific manuscripts submitted for publication review that have exhibited significant flaws preventing their publication; flaws that often directly stem from a lack of knowledge regarding the "state-of-the-art" methods and accepted literature base that is rapidly changing as the field evolves. The purpose of this commentary is to provide researchers-especially those coming from a nongenetics background attempting to publish in the exercise system genetics area-with recommendations regarding best-practice research standards and data analysis in the field of exercise systems genetics, to strengthen the overall literature in this important and evolving field of research.


Subject(s)
Exercise , Physiological Phenomena/genetics , Polymorphism, Single Nucleotide/genetics , Publishing/standards , Research/standards , Athletic Performance/physiology , Data Analysis , Genome-Wide Association Study/standards , Genotype , Humans , Muscle Strength/genetics , Phenotype , Physical Conditioning, Human , Physical Endurance/genetics , Quality Control , Reproducibility of Results , Research Design/standards , Reverse Transcriptase Polymerase Chain Reaction , Sample Size , Sports/physiology
16.
Genetics ; 217(3)2021 03 31.
Article in English | MEDLINE | ID: mdl-33789342

ABSTRACT

Ghost quantitative trait loci (QTL) are the false discoveries in QTL mapping, that arise due to the "accumulation" of the polygenic effects, uniformly distributed over the genome. The locations on the chromosome that are strongly correlated with the total of the polygenic effects depend on a specific sample correlation structure determined by the genotypes at all loci. The problem is particularly severe when the same genotypes are used to study multiple QTL, e.g. using recombinant inbred lines or studying the expression QTL. In this case, the ghost QTL phenomenon can lead to false hotspots, where multiple QTL show apparent linkage to the same locus. We illustrate the problem using the classic backcross design and suggest that it can be solved by the application of the extended mixed effect model, where the random effects are allowed to have a nonzero mean. We provide formulas for estimating the thresholds for the corresponding t-test statistics and use them in the stepwise selection strategy, which allows for a simultaneous detection of several QTL. Extensive simulation studies illustrate that our approach eliminates ghost QTL/false hotspots, while preserving a high power of true QTL detection.


Subject(s)
Crosses, Genetic , Models, Genetic , Multifactorial Inheritance , Quantitative Trait Loci , Animals , Breeding/methods , Genome-Wide Association Study/methods , Genome-Wide Association Study/standards , Plants/genetics
17.
Am J Med Genet B Neuropsychiatr Genet ; 186(1): 16-27, 2021 01.
Article in English | MEDLINE | ID: mdl-33576176

ABSTRACT

Genotype imputation across populations of mixed ancestry is critical for optimal discovery in large-scale genome-wide association studies (GWAS). Methods for direct imputation of GWAS summary-statistics were previously shown to be practically as accurate as summary statistics produced after raw genotype imputation, while incurring orders of magnitude lower computational burden. Given that direct imputation needs a precise estimation of linkage-disequilibrium (LD) and that most of the methods using a small reference panel for example, ~2,500-subject coming from the 1000 Genome-Project, there is a great need for much larger and more diverse reference panels. To accurately estimate the LD needed for an exhaustive analysis of any cosmopolitan cohort, we developed DISTMIX2. DISTMIX2: (a) uses a much larger and more diverse reference panel compared to traditional reference panels, and (b) can estimate weights of ethnic-mixture based solely on Z-scores, when allele frequencies are not available. We applied DISTMIX2 to GWAS summary-statistics from the psychiatric genetic consortium (PGC). DISTMIX2 uncovered signals in numerous new regions, with most of these findings coming from the rarer variants. Rarer variants provide much sharper location for the signals compared with common variants, as the LD for rare variants extends over a lower distance than for common ones. For example, while the original PGC post-traumatic stress disorder GWAS found only 3 marginal signals for common variants, we now uncover a very strong signal for a rare variant in PKN2, a gene associated with neuronal and hippocampal development. Thus, DISTMIX2 provides a robust and fast (re)imputation approach for most psychiatric GWAS-studies.


Subject(s)
Genome-Wide Association Study/standards , Mental Disorders/diagnosis , Mental Disorders/genetics , Polymorphism, Single Nucleotide , Cohort Studies , Gene Frequency , Humans , Linkage Disequilibrium , Phenotype , Reference Standards , Software
18.
Genome Res ; 31(4): 529-537, 2021 04.
Article in English | MEDLINE | ID: mdl-33536225

ABSTRACT

Low-pass sequencing (sequencing a genome to an average depth less than 1× coverage) combined with genotype imputation has been proposed as an alternative to genotyping arrays for trait mapping and calculation of polygenic scores. To empirically assess the relative performance of these technologies for different applications, we performed low-pass sequencing (targeting coverage levels of 0.5× and 1×) and array genotyping (using the Illumina Global Screening Array [GSA]) on 120 DNA samples derived from African- and European-ancestry individuals that are part of the 1000 Genomes Project. We then imputed both the sequencing data and the genotyping array data to the 1000 Genomes Phase 3 haplotype reference panel using a leave-one-out design. We evaluated overall imputation accuracy from these different assays as well as overall power for GWAS from imputed data and computed polygenic risk scores for coronary artery disease and breast cancer using previously derived weights. We conclude that low-pass sequencing plus imputation, in addition to providing a substantial increase in statistical power for genome-wide association studies, provides increased accuracy for polygenic risk prediction at effective coverages of ∼0.5× and higher compared to the Illumina GSA.


Subject(s)
Genome-Wide Association Study , Genotype , High-Throughput Nucleotide Sequencing , Genome, Human , Genome-Wide Association Study/methods , Genome-Wide Association Study/standards , Haplotypes , Humans , Risk Factors
19.
PLoS Comput Biol ; 17(2): e1007784, 2021 02.
Article in English | MEDLINE | ID: mdl-33606672

ABSTRACT

Rare variants are thought to play an important role in the etiology of complex diseases and may explain a significant fraction of the missing heritability in genetic disease studies. Next-generation sequencing facilitates the association of rare variants in coding or regulatory regions with complex diseases in large cohorts at genome-wide scale. However, rare variant association studies (RVAS) still lack power when cohorts are small to medium-sized and if genetic variation explains a small fraction of phenotypic variance. Here we present a novel Bayesian rare variant Association Test using Integrated Nested Laplace Approximation (BATI). Unlike existing RVAS tests, BATI allows integration of individual or variant-specific features as covariates, while efficiently performing inference based on full model estimation. We demonstrate that BATI outperforms established RVAS methods on realistic, semi-synthetic whole-exome sequencing cohorts, especially when using meaningful biological context, such as functional annotation. We show that BATI achieves power above 70% in scenarios in which competing tests fail to identify risk genes, e.g. when risk variants in sum explain less than 0.5% of phenotypic variance. We have integrated BATI, together with five existing RVAS tests in the 'Rare Variant Genome Wide Association Study' (rvGWAS) framework for data analyzed by whole-exome or whole genome sequencing. rvGWAS supports rare variant association for genes or any other biological unit such as promoters, while allowing the analysis of essential functionalities like quality control or filtering. Applying rvGWAS to a Chronic Lymphocytic Leukemia study we identified eight candidate predisposition genes, including EHMT2 and COPS7A.


Subject(s)
Genetic Variation , Genome-Wide Association Study/methods , Bayes Theorem , Benchmarking , Breast Neoplasms/genetics , COP9 Signalosome Complex/genetics , Case-Control Studies , Cohort Studies , Computational Biology , Computer Simulation , Data Interpretation, Statistical , Databases, Genetic , Female , Genetic Predisposition to Disease , Genome-Wide Association Study/standards , Genome-Wide Association Study/statistics & numerical data , Histocompatibility Antigens/genetics , Histone-Lysine N-Methyltransferase/genetics , Humans , Leukemia, Lymphocytic, Chronic, B-Cell/genetics , Quality Control , Risk Factors , Transcription Factors/genetics , Exome Sequencing/methods , Exome Sequencing/standards , Exome Sequencing/statistics & numerical data , Whole Genome Sequencing/methods , Whole Genome Sequencing/statistics & numerical data
20.
Eur J Hum Genet ; 29(5): 839-850, 2021 05.
Article in English | MEDLINE | ID: mdl-33500576

ABSTRACT

Recent studies consider lifestyle risk score (LRS), an aggregation of multiple lifestyle exposures, in identifying association of gene-lifestyle interaction with disease traits. However, not all cohorts have data on all lifestyle factors, leading to increased heterogeneity in the environmental exposure in collaborative meta-analyses. We compared and evaluated four approaches (Naïve, Safe, Complete and Moderator Approaches) to handle the missingness in LRS-stratified meta-analyses under various scenarios. Compared to "benchmark" results with all lifestyle factors available for all cohorts, the Complete Approach, which included only cohorts with all lifestyle components, was underpowered due to lower sample size, and the Naïve Approach, which utilized all available data and ignored the missingness, was slightly inflated. The Safe Approach, which used all data in LRS-exposed group and only included cohorts with all lifestyle factors available in the LRS-unexposed group, and the Moderator Approach, which handled missingness via moderator meta-regression, were both slightly conservative and yielded almost identical p values. We also evaluated the performance of the Safe Approach under different scenarios. We observed that the larger the proportion of cohorts without missingness included, the more accurate the results compared to "benchmark" results. In conclusion, we generally recommend the Safe Approach, a straightforward and non-inflated approach, to handle heterogeneity among cohorts in the LRS based genome-wide interaction meta-analyses.


Subject(s)
Cardiometabolic Risk Factors , Gene-Environment Interaction , Genome-Wide Association Study/methods , Hypertension/genetics , Obesity/genetics , Genome-Wide Association Study/standards , Healthy Lifestyle , Humans , Hypertension/epidemiology , Obesity/epidemiology
SELECTION OF CITATIONS
SEARCH DETAIL
...